fix(geometry): retry point-in-polygon with finer batching on GPU OOM by EliHei2 · Pull Request #62 · dpeerlab/segger

EliHei2 · 2026-06-01T13:10:12Z

The batched quadtree point-in-polygon join could OOM on very large inputs (notably MERSCOPE, with millions of transcripts) and crash the run. Wrap the batch loop so a CUDA out-of-memory error retries with progressively finer batching (doubling the batch count up to 256) before giving up, and return an empty match frame when there are no results instead of failing in cudf.concat.

What to review: the OOM-detection (errors only; non-OOM re-raised immediately) and the bounded retry in _points_in_polygons_contains. Keeps main's quadtree API; no change to the happy path.

The batched quadtree point-in-polygon join could OOM on very large inputs (notably MERSCOPE, with millions of transcripts) and crash the run. Wrap the batch loop so a CUDA out-of-memory error retries with progressively finer batching (doubling the batch count up to 256) before giving up, and return an empty match frame when there are no results instead of failing in cudf.concat. What to review: the OOM-detection (errors only; non-OOM re-raised immediately) and the bounded retry in _points_in_polygons_contains. Keeps main's quadtree API; no change to the happy path.

Tobiaspk · 2026-06-08T16:06:03Z

Good catch, down to merge if it makes things more robust. Two things first:

Do you have a log that shows that points-in-polygon is really the failure. The OOMs I've seen on our SLURM cluster otfen show up as oom_kill events or C++ std::bad_alloc (which is a RuntimeError) and those happen outside points-in-polygon for the 1B transcript Atera dataset. As side-node, I think we couldn't even catch an oom_kill here, which is a slurm SIGKILL. Our out of memory logs typically look like:

/var/spool/slurmd/job11412228/slurm_script: line 21: 4167901 Killed                  segger segment --input-directory /data1/collab002/sail/projects/ongoing/segger_dev/data/inputs/WTA_Preview_FFPE_Breast_Cancer --output-directory /data1/collab002/sail/projects/ongoing/segger_dev/data/outputs/WTA_Preview_FFPE_Breast_Cancer_v3 --tiling-margin-training 5.0 --tiling-margin-prediction 5.0 --debug
[2026-05-27T20:15:16.212] error: Detected 1 oom_kill event in StepId=11412228.batch. Some of the step tasks have been OOM Killed.

Could you share a log showing it crash in points-in-polygon on your end?

Going forward, similar to fix(tiling): fall back to a smaller margin instead of dropping tiles #61 let's try to estimate max batch size up front, rather than rely on error catching (which admittedly will be hard for GPU). Alternatively, we could try configure the RMM and CuPy allocators to avoid this out of the box.

Happy to merge meanwhile, but would like to confirm it's really points-in-polygon before.

EliHei2 assigned Tobiaspk Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(geometry): retry point-in-polygon with finer batching on GPU OOM#62

fix(geometry): retry point-in-polygon with finer batching on GPU OOM#62
EliHei2 wants to merge 1 commit into
mainfrom
bugfix/spatial-join-oom

EliHei2 commented Jun 1, 2026

Uh oh!

Tobiaspk commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EliHei2 commented Jun 1, 2026

Uh oh!

Tobiaspk commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants